Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep ChannelBrokerState in the database #2297

Merged

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Aug 28, 2023

This allows to resume operations as if (almost) nothing happened when Surveda restarts.

The state is now saved in the database on each operation to the ChannelBrokerState, instead of fully kept in memory. If Surveda crashes or needs to be restarted, we don't lose anything, and don't cause surveys to hang because the survey broker activated enough respondents, but we lost the queue because of a restart. That should also reduce the memory usages, at the expense of more interaction with MySQL.

Improves the test suite with tests for the Ask.Runtime.ChannelBrokerState module to ascertain its individual behavior. The behavior got changed in a few places to accommodate the new design yet still be valid if we change the data store (again).

I replaced the default Ecto.Adapters.SQL.Sandbox pool used in the test suite (Ecto default) with a DatabaseCleaner solution. The sandbox only allows a single connection to be checkout for transactional tests, but it was breaking the test suite: there are long transactions in SurveyBroker and Session that call into the ChannelBroker process that got blocked because the process failed to checkout the unique connection (already checkout by another process in a blocking manner). The DatabaseCleaner solution is only marginally slower, and could be faster by using the MyXQL driver directly.

⚠️ I wanted to have only one row per channel & respondent (composite primary key of channel_id and respondent_id), but that failed quickly in the test suite that tries to queue the same respondent multiple times (:sweat:). I fixed the test suite by adopting an UPSERT strategy: whenever there is a conflict, the new contact will deactivate and replace the existing contact in the queue.

  • In theory this may be incorrect: replacing a queued contact for another queued contact seems fine (forget about the previous attempt) but deactivating an active contact may be overboard?

  • In practice this may not be an issue: the runtime first asks the channel-broker whether a message is already queued (has_queued_message?) or expired (``) before timing out an attempt (see Ask.Runtime.SurveyBroker.retry_respondent/1 and `Ask.Runtime.Session.timemout/1`).

@ysbaddaden ysbaddaden self-assigned this Aug 28, 2023
@ysbaddaden ysbaddaden force-pushed the feature/keep-channel-broker-state-in-the-database branch from bdd3999 to 50b6267 Compare August 28, 2023 17:26
defp under_capacity?(state) do
count_active_contacts(state) < state.capacity
Queue.count_active_contacts(state.channel_id) < state.capacity
end
Copy link
Contributor Author

@ysbaddaden ysbaddaden Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method may have a negative impact on performance: we're counting how many contacts are active for the channel right from the database, and we count every time we try to activate a contact.

If it ever becomes a problem, we could keep a counter in state for the number of active contacts, and manipulate it as we activate / increment / decrement / deactivate / reenqueue, instead of counting all the time. We'd only count from the database on init/1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we definitely should initialise the count from the DB upon start, and then keep track in-memory. Adding an extra query for each respondent we try to contact may be a big overload on the DB - does that query have an easy index, at least?

We can always test to be sure before optimising a bit too early.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are a couple indexes:

  • PRIMARY KEY (channel_id, respondent_id)
  • INDEX (last_contact)

I tried to avoid aggregate methods everywhere, preferring the quick Repo.exists?, but the capacity check is the only request where I needed an actual value.

I didn't implement the optimization because keeping a cache is easy, but keeping it correctly updated is another matter.

query =
from q in Queue,
select: [:channel_id, :respondent_id, :channel_state],
where: q.channel_id == ^state.channel_id and q.last_contact < ^last_contact
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, maybe we could ORDER BY last_contact ASC to process the oldest idle contacts first?

@ysbaddaden ysbaddaden linked an issue Aug 29, 2023 that may be closed by this pull request
@ysbaddaden ysbaddaden marked this pull request as ready for review August 29, 2023 16:59
@matiasgarciaisaia matiasgarciaisaia force-pushed the feature/keep-channel-broker-state-in-the-database branch from 9e2dabf to 7a73ed0 Compare September 4, 2023 19:55
Copy link
Member

@matiasgarciaisaia matiasgarciaisaia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to finish reviewing the channel_broker_state_test.exs file.

priv/repo/structure.sql Show resolved Hide resolved
defp under_capacity?(state) do
count_active_contacts(state) < state.capacity
Queue.count_active_contacts(state.channel_id) < state.capacity
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we definitely should initialise the count from the DB upon start, and then keep track in-memory. Adding an extra query for each respondent we try to contact may be a big overload on the DB - does that query have an easy index, at least?

We can always test to be sure before optimising a bit too early.

Comment on lines +356 to +359
Repo.all(
from r in "respondents",
select: r.id,
where: r.id in ^State.active_respondent_ids(state) and r.state == "active"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we somehow limit this query? Paginate and run multiple times? I'm thinking about full-scale surveys.

Copy link
Contributor Author

@ysbaddaden ysbaddaden Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already limited to the channel capacity (e.g. 200) not the batch size (e.g. 20000) since the query restricts to respondents that the channel-broker considers active.

Maybe we could reverse the logic to return inactive respondents instead of active ones? We expect there won't be less active than we think there are. Then we could paginate and clean-up the respondents from the channel-broker-queue in multiple passes. 🤔

channel_id: state.channel_id,
respondent_id: respondent.id,
queued_at: SystemTime.time().now,
priority: priority,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can't make the same query choose the highest priority between the new one and the one that was already in the DB (in case of an UPDATE), right?

Doesn't really matter, I think, since this shouldn't go through the UPDATE path - but just thinking out loud.

lib/ask/runtime/channel_broker_state.ex Show resolved Hide resolved
Comment on lines +312 to +314
# FIXME: this may take a while, and during that time the channel broker
# won't process its mailbox, if it ever becomes a problem, we might
# consider:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If this could block the channel broker for a long time, let's make it run in small bursts (LIMIT 50 and enqueue a new call for this function in the process' mailbox?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could spawn a Task to run the checks in parallel and have a new message to handle each response individually (e.g. :remove_inactive_respondent) 🤔

test/ask/runtime/channel_broker_state_test.exs Outdated Show resolved Hide resolved
test/ask/runtime/channel_broker_state_test.exs Outdated Show resolved Hide resolved
test/ask/runtime/channel_broker_state_test.exs Outdated Show resolved Hide resolved
Co-authored-by: Matías García Isaía <[email protected]>
Copy link
Member

@matiasgarciaisaia matiasgarciaisaia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could improve the naming of some models, and there are some rough edges to polish, but overall the PR is OK 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keep the ChannelBroker state in the database
2 participants